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Abstract. The main aim of this paper is to compare the results of sev¬ 
eral methods of prediction with confidence. In particular we compare the 
results of Venn Machine with Platt’s Method of estimating confidence. 
The results are presented and discussed. 
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1 Introduction 

There are many machine learning algorithms that allow to make classification 
and regression estimation. However, many of them suffer from the absence of a 
confidence measure to assess the risk of error made by an individual prediction. 

Sometimes, however, the confidence measure is introduced but very often 
it is an ad hoc measure. An example of this is a Platt’s algorithm developed 
to estimate confidence for SVMjTj. We recently developed a set of new machine 
learning algorithms |2)3| that allow not just to make prediction but also to supply 
this prediction with a measure of confidence. What’s more important is that this 
measure is valid and based on a well-developed algorithmic randomness theory. 

The algorithm introduced in this paper is Venn Machine [3], a method that 
outputs the prediction with an interval of probability that prediction is correct. 
What follows is an introduction to Venn Machine and Platt’s Method, then 
description of used data and results of experiments. 

1.1 Venn Machine 

Let us consider a training set consisting of object, Xi, and label, j/j, as pairs: 
(xi,y i),..., (x n -i,y n -i). The possible labels are finite, that is, y £ Y. Our task 
is to predict the label y n for the new object x n and give the estimation of the 
likelihood that our prediction is correct. 

In brief, Venn Machine operates as follows. First, we define a taxonomy that 
can divide all examples into categories. Then, we try all the possible labels of the 
new object. In each attempt, we can calculate the frequencies of the labels in the 


category which the new object falls into. The minimum frequency is called the 
quality of this column. At last, we output the assumed label with the highest 
quality among all the columns as our prediction and output the minimum and 
the maximum frequencies of this column as the interval of the probability that 
this prediction is correct. 

Taxonomy (or, more fully, Venn taxonomy) is a function A n , n £ N of the 
space Z^" -1 ) x Z that divide every example into one of the finite categories r- t , 
Tj G T. Then we consider Zi as the pair ( Xi,yi ), 

T i = A n ({z 1 ,...,Z i - 1 ,Z i+1 ,...,Z n },Z i ) (1) 


We assign Zi and z 3 to the same category if and only if 

A n ({zi , • • • j Zi— i, Zi- |_i, . • • , Z n J, Zi) = A n ({-vL, • • • ; Zj — \, Zj+\ , . . . , Z n } , Zj) (2) 

Here is an example of a simplest taxonomy based on 1-nearest neighbour (INN). 

We assign the category of an example the same to the label of its nearest 
neighbour based on the distance between two objects (e.g. Euclidean distance). 


A n ({zi,..., Zi-!, z i+1 ,..., z n }, Zi) =Ti = y 3 


( 3 ) 


where 


j = arg min | \x t - Xj 

j = l,...,2—1,2+1 


( 4 ) 


For every attempt (x n ,y), of which the category is r, let p y be the empirical 
probability distribution of the labels in category r. 


PyW} 


|{(z*,2/*) £t :y* = y'}\ 


( 5 ) 


this is a probability distribution on Y. The set P n := {p y '■ y G Y} is the 
multiprobability predictor consists of K probabilities, where K = |Y|. 

After all attempts, we get a K x K matrix P. Let the best column with the 
highest quality, which is the minimum entry of a column, be jbest- jbest is our 
prediction and the interval of the probability that the prediction is correct is 


[. mi n Pi dbeBt , . max Pi, jbe J 

i—l 2 = 1 ,..., K 


( 6 ) 


1.2 Platt’s Method 

Standard Support Vector Machines (SVM) [3] only output the value of sign(f(xi )), 
where / is the decision function. So we can say that SVM is a non-probabilistic 
binary linear classifier. But in many cases we are more interested in the belief 
that the label should be +1, that is, the probability P(y = l|x). Platt introduced 
a method to estimate posterior probabilities based on the decision function / by 
fitting a sigmoid for SVM. 

1 


P(y = i|/) 


1 + exp(Af + B) 


( 7 ) 




The best parameter A and B are determined by using maximum likelihood 
estimation from a training set (/», 3 /*). Let us use regularized target probabilities 
tj as the new training set ( fi,ti ) defined as: 


JV + + i -r . 

at + +2 > ^ y * ~ 

JVT+2- if Vi = - 1 


( 8 ) 


where 7V + is the number of positive examples, while iV_ is the number of negative 
examples. Then, the parameters A and B are found by minimizing the negative 
log likelihood of the training data, which is a cross-entropy error function. 


- V'Oi log (pi) + (1 - U) log(l - pi)) —> min (9) 

Pi 

l 


where the solution is 


1 


P ‘ 1 + exp(Afi + B) 

With parameters A and B we can calculate the posterior probability that the 
label should be +1 of every example using (10). But in many cases, probability 
that the prediction is correct is more useful and easy to compare with Venn 
Machine. In this binary classification problem, one example with the probability 
Pi means its label should be +1 with the likelihood of pi, that is to say, its 
label should be —1 with the likelihood of 1 — p t . So we use the complementary 
probability when the probability is less than the optimal threshold (in this paper 
we set it to 0.5 as explained later). 


2 Data Sets 

The data sets we used in this paper is Salmonella mass spectrometry data pro¬ 
vided by VLA^ and Wisconsin Diagnostic Breast Cancer (WDBC) data from 
UCI. 

The aim of the study of Salmonella data is to discriminate Salmonella vac¬ 
cine strains from wild type field strains of the same serotype. We analysed the 
set of 50 vaccine strains (Gallivav vaccine strain) and 43 wild type strains. Both 
vaccine and wild type strains belong to the same serotype Salmonella enteritidis. 

Each strain was represented by three spots; each spot produced 3 spot repli¬ 
cates. Therefore, there are 9 replicates per strain. Pre-processing was applied to 
each replicate and resulted in representation of each mass spectra as a vector of 
25 features corresponding to the intensity of most common peaks. The median 
was later taken for each feature across replicates of the same strain. In the data 
set, label +1 corresponds to vaccine strains, label —1 to wild type strains. Table 
[T] shows some quantitive properties of the data set. 

In Figure [T] there is a plot of the class-conditional densities p(f\y = ±1) of 
Salmonella data. The plot shows histograms of the densities of the data set with 
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Table 1. Salmonella Data Set Features 


Number of 

Number of 

Number of Positive 

Number of Negative 

Instances 

Attributes 

Examples 

Examples 

93 

25 

50 

43 


bins 0.1 wide, derived from Leave-One-Out Cross-Validation. The solid line is 
p(f\y = +1), while the dot line is p(f\y = —1). What we observed from the plot 
is that this a linearly non-separable data set. 



Fig. 1. The histograms for p(f\y = ±1) for a linear SVM trained on the Salmonella 
Data Set. 


The second data set is Wisconsin Diagnostic Breast Cancer data. There are ten 
real-valued features computed for each cell nucleus, resulting in 30 features in the 
data set. These features are from a digitized image of a fine needle aspirate (FNA) 
of a breast mass. They describe characteristics of the cell nuclei present in the 
image. And the diagnosis includes two predicting fields, label +1 corresponding 
to Benign and label —1 corresponding to Malignant. Data set is linearly separable 
using all 30 input features. Table [2] shows some quantitive properties of the data 
set. 













Table 2. Wisconsin Diagnostic Breast Cancer Data Set Features 


Number of 

Number of 

Number of Positive 

Number of Negative 

Instances 

Attributes 

Examples 

Examples 

569 

30 

357 

212 


3 Empirical Result 

There are two experiments in this paper to compare the performance of Venn 
Machine with the SVM+sigmoid combination in Platt’s Method. 

3.1 Taxonomy Design 

The taxonomy used in both experiments is newly designed and it is based on 
the decision function the same as Platt’s Method. 

Let the number of categories Kt = |T| and the taxonomy is further referred 
to as A't-SVM. Then we train an SVM for the whole data {(xi, t/i),..., (x n , y n )} 
and calculate the decision values for all examples. We put the examples into the 
same category if the decision values of them are in the same interval which is 
generated depending on Kt- 

For an instance, if Kt = 8, the intervals can be (—oo,—1.5], (—1.5,—1.0], 
(-1.0, -0.5], (-0.5,0], (0,0.5], (0.5,1.0], (1.0,1.5], (1.5, oo). 

3.2 Experiments 

The first experiment dealing with Salmonella data set is using a radial basis 
function (i.e. RBF) kernel in SVM and a Venn Machine with 8-SVM taxonomy 
since Salmonella data is a linearly non-separable data set . And the second exper¬ 
iment dealing with WDBC data set is using a linear kernel in SVM (i.e. Standard 
SVM) and a Venn Machine with 6-SVM taxonomy. The Venn Machine can be 
compared to Platt’s Method and the raw SVM for accuracies and estimated 
probabilities. Assuming equal loss for Type I and Type II errors, the optimal 
threshold for the Platt’s Method is P(y = 1|/) = 0.5. And all of the results in 
this paper are presented using Leave-One-Out Cross-Validation (LOOCV). 

Table [3] shows the parameters setting for experiments. The C value is the 
cost for the SVM. And the Underlying Algorithm is the algorithm used in the 
taxonomy for Venn Machine and the kernel used in SVM. The Kernal Parameter 
is cr, the parameter of RBF. 

Table [4] is the results of experiments. The table lists the accuracies and the 
probabilistic outputs for raw SVM, Platt’s Method, and Venn Machine using 
both data sets. For Platt’s Method, the probabilistic output is the average es¬ 
timated probability that the prediction is correct. And for Venn Machine, the 
probabilistic output is the average estimated interval of probability that the 
prediction is correct. 





Table 3. Experimental Parameters 


Data Set 

Task 

C 

Underlying Algorithm 

Kernal 

Parameter 

Salmonella 

SVM 

1 

RBF 

0.05 


Platt’s Method 

1 

RBF 

0.05 


Venn Machine 

1 

8-SVM RBF 

0.05 

WDBC 

SVM 

1 

Linear 



Platt’s Method 

1 

Linear 



Venn Machine 

1 

6-SVM Linear 



Table 4. Experimental Results 


Data Set 

Task 

Accuracy 

Probabilistic Outputs 

Salmonella 

SVM 

81.72% 


Platt’s Method 

82.80% 

84.77% 

Venn Machine 

90.32% 

[83.49%, 91.03%] 

WDBC 

SVM 

97.72% 


Platt’s Method 

98.07% 

96.20% 

Venn Machine 

98.24% 

[97.22%, 98.27%] 


3.3 Results 

Table [5] lists some comparisons between two methods. As shown in the table, 
Venn Machine got better results in both two data sets. For Salmonella data set, 
Venn Machine got a significant improvement (7.52%) comparing with Platt’s 
Method in accuracy when it used a 8-SVM RBF taxonomy. In the aspect of 
probabilistic outputs, Venn Machine output an interval of probability with the 
accuracy included while the probabilistic output of Platt’s Method is 1.93% 
higher than the accuracy. For WDBC data set, Venn Machine increased by 0.52% 
in accuracy while Platt’s Method got 0.35%. In the aspect of probabilistic out¬ 
puts, Venn Machine output an interval of probability with the accuracy included 
while the probabilistic output of Platt’s Method is 1.87% lower than the accu¬ 
racy. 

Sensitivity and specificity are also calculated and shown in Table [5] For 
Salmonella Data Set, Venn Machine got a outstanding result in sensitivity, 
16.00% better than Platt’s Method. It is obvious that Venn Machine got a better 
ability of identity salmonella vaccine. And for WDBC Data Set, they got approx¬ 
imate results in both sensitivity and specificity. It is hard to tell which method 
is better, but we can still find Venn Machine has made a slight improvement in 
both aspects. 

Another interest thing we observed is that Platt’s Method performs better 








on linearly separable data set (that is WDBC in this paper) than linearly non- 
separable data set (that is Salmonella data set), while Venn Machine can achieve 
good results on both data sets. But it needs conducting experiments on more 
data sets to prove this. 


Table 5. Comparisons Between Two Methods 


Data Set 

Task 

Accuracy 

Probabilistic 

Outputs 

Sensitivity Specificity 

Salmonella 

Platt’s Method 

82.80% 

84.77% 

76.00% 

67.44% 


Venn Machine 

90.32% 

[83.49%, 91.03%] 

92.00% 

67.44% 

WDBC 

Platt’s Method 

98.07% 

96.20% 

97.52% 

99.02% 


Venn Machine 

98.24% 

[97.22%, 98.27%] 

97.53% 

99.50% 


Table [6] shows several examples in Salmonella data set predicted by Venn 
Machine and Platt’s Method. For each example, the table contains the true label, 
prediction of Venn Machine and intervals of probability that the prediction is 
correct, the prediction of Platt’s Method and the probabilistic outputs. The table 
indicates that both methods can be proper or erroneous. For instance, wild type 
strain 2, 4, 5 and vaccine strain 44 are both wrong for the two methods. 


Table 6. Prediction for Individual Examples in Salmonella Data Set 


No. True Prediction Probabilistic Outputs of Prediction Probabilistic 



Label 

of VM 

VM 

of PM 

Outputs of PM 

1 

-1 

-1 

[88.89%, 100.00%] 

-1 

95.65% 

2 

-1 

+1 

[60.00%, 63.33%] 

+1 

77.72% 

3 

-1 

-1 

[88.89%, 100.00%] 

-1 

98.49% 

4 

-1 

+1 

[60.00%, 63.33%] 

+1 

63.98% 

5 

-1 

+1 

[60.00%, 63.33%] 

+1 

56.57% 

6 

-1 

-1 

[76.19%, 80.95%] 

-1 

78.57% 

7 

-1 

-1 

[76.19%, 80.95%] 

- 1 

71.69% 

44 

+1 

-1 

[80.95%, 85.71%] 

-1 

71.92% 

45 

+1 

+1 

[90.00%, 93.33%] 

+1 

96.91% 

46 

+1 

+1 

[90.00%, 93.33%] 

+1 

77.96% 

47 

+1 

+1 

[56.67%, 60.00%] 

-1 

61.81% 

48 

+1 

+1 

[56.67%, 60.00%] 

-1 

58.43% 

49 

+1 

+1 

[90.00%, 93.33%] 

+1 

96.15% 

50 

+1 

+1 

[90.00%, 93.33%] 

+1 

94.12% 













4 Conclusion 


From our experience on these data sets we see the following. The Platt’s estima¬ 
tion for the accuracy of prediction can be too optimistic or too pessimistic, while 
Venn’s bounds estimate it more correctly: two-sided estimation is safer than sin¬ 
gle one. As for the accuracy itself, we see that if Platt’s and Venn Machines are 
based on the same kind of SVM, accuracy of Venn Machine is also a bit better. 
This may be because Venn Machine do not rely on a fixed transformation of the 
SVM output, but makes its own transformation for each taxonomy, based on the 
actual data set. 

We applied different probabilistic approaches to the dataset of Salmonella 
strains. As it can be seen from Figure [l] and Table [bj this data set is hard to 
separate: there are few errors in the class +1, but large part of examples from 
the class —1 seems to be hardly distinguishable from the class +1. This is why 
in this case we need to have individual assessment of prediction quality: being 
unable to make a confident prediction on any example, we still can select some 
of them where our prediction has higher chance to be correct. 

The results have been observed on two particular data sets. We plan to con¬ 
duct experiments on bigger data sets. Another possible direction is to compare 
Venn Machine and Platt’s Method theoretically. 
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